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Abstract 

Deep convolutional neural networks (DCNNs) have 
attracted much attention recently, and have shown 
to be able to recognize thousands of object cat¬ 
egories in natural image databases. Their archi¬ 
tecture is somewhat similar to that of the human 
visual system: both use restricted receptive fields, 
and a hierarchy of layers which progressively ex¬ 
tract more and more abstracted features. Yet it 
is unknown whether DCNNs match human perfor¬ 
mance at the task of view-invariant object recogni¬ 
tion, whether they make similar errors and use sim¬ 
ilar representations for this task, and whether the 
answers depend on the magnitude of the viewpoint 
variations. To investigate these issues, we bench- 
marked eight state-of-the-art DCNNs, the HMAX 
model, and a baseline shallow model and compared 
their results to those of humans with backward 
masking. Unlike in all previous DCNN studies, 
we carefully controlled the magnitude of the view¬ 
point variations to demonstrate that shallow nets 
can outperform deep nets and humans when vari¬ 
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ations are weak. When facing larger variations, 
however, more layers were needed to match human 
performance and error distributions, and to have 
representations that are consistent with human be¬ 
havior. A very deep net with 18 layers even outper¬ 
formed humans at the highest variation level, using 
the most human-like representations. 

Introduction 

Primates excel at view-invariant object recogni¬ 
tion [T] . This is a computationally demanding task, 
as an individual object can lead to an infinite num¬ 
ber of very different projections onto the retinal 
photoreceptors while it varies under different 2-D 
and 3-D transformations. It is believed that the 
primate visual system solves the task through hier¬ 
archical processing along the ventral stream of the 
visual cortex [1] . This stream ends in the inferotem- 
poral cortex (IT), where object representations are 
robust, invariant, and linearly-separable [21 [I]- Al¬ 
though there are extensive within- and between- 
area feedback connections in the visual system, 
neurophysiological 011!, behavioral [5|, and com¬ 
putational [6] studies suggest that the first feed¬ 
forward flow of information (~ 100 — 150 ms post¬ 
stimulus presentation) might be sufficient for object 
recognition 013 and even invariant object recog- 
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nition [slIllEllTj. 

Motivated by this feed-forward information flow 
and the hierarchical organization of the visual cor¬ 
tical areas, many computational models have been 
developed over the last decades to mimic the per¬ 
formance of the primate ventral visual pathway in 
object recognition. Early models were only com¬ 
prised of a few layers [HI El UHl [HI 112], while the 
new generation, called “deep convolutional neu¬ 
ral networks” (DCNNs) contain many layers (8 
and above). DCNNs are large neural networks 
with millions of free parameters that are opti¬ 
mized through an extensive training phase us¬ 
ing millions of labeled images [IHj. They have 
shown impressive performances in difficult object 
and scene categorization tasks with hundreds of 
categories fTTl [THl dSl dS dZl HE]- Yet the view¬ 
point variations were not carefully controlled in 
these studies. This is an important limitation: in 
the past, it has been shown that models perform¬ 
ing well on apparently challenging image databases 
may fail to reach human-level performance when 
objects are varied in size, position, and most impor¬ 
tantly 3-D transformations [HlEnilllllSS] • DCNNs 
are position invariant by construction, thanks to 
weight sharing. However, for other transformations 
such as scale, rotation in depth, rotation in plane, 
and 3-D transformations, there is no built-in in¬ 
variance mechanism. Instead, these invariances are 
acquired through learning. Although the features 
extracted by DCNNs are significantly more power¬ 
ful than their hand-designed counterparts like SIFT 
and HOG [201 I2H], they may have difficulties to 
tackle 3-D transformations. 

To date, only a handful of studies have assessed 
the performance of DCNNs and their constituent 
layers in invariant object recognition [211 ESI 12011201 
l271128] . In this study we systematically compared 
humans and DCNNs at view-invariant object recog¬ 
nition, using exactly the same images. The advan¬ 
tages of our work with respect to previous studies 
are: (1) we used a larger object database, divided 
into five categories; (2) most importantly, we con¬ 
trolled and varied the magnitude of the variations 
in size, position, in-depth and in-plane rotations; 
(3) we benchmarked eight state-of-the-art DCNNs, 
the HMAX model (an early biologically in¬ 


spired shallow model), and a very simple shallow 
model that classifies directly from the pixel values 
("Pixel”); (4) in our psychophysical experiments, 
the images were presented briefly and with back¬ 
ward masking, presumably blocking feedback; (5) 
we performed extensive comparisons between dif¬ 
ferent layers of DCNNs and studied how invariance 
evolves through the layers; (6) we compared models 
and humans in terms of performance, error distri¬ 
butions, and representational geometry; and (7) to 
measure the influence of the background on the in¬ 
variant object recognition problem our dataset in¬ 
cluded both segmented and unsegmented images. 

This approach led to new findings: (1) Deeper 
was usually better and more human-like, but only 
in the presence of large variations; (2) Some DC¬ 
NNs reached human performance even with large 
variations; (3) Some DCNNs had error distribu¬ 
tions which were indiscernible from those of hu¬ 
mans; (4) Some DCNNs used representations that 
were more consistent with human responses, and 
these were not necessarily the top performers. 

Materials and methods 

Deep convolutional neural networks 
(DCNNs) 

The idea behind DCNNs is a combination of deep 
learning [H] with convolutional neural networks [2] . 
DCNNs have a hierarchy of several consecutive fea¬ 
ture detector layers. Lower layers are mainly se¬ 
lective to simple features while higher layers tend 
to detect more complex features. Convolution is 
the main process in each layer that is generally fol¬ 
lowed by complementary operations such as max 
pooling and output normalization. Up to now, var¬ 
ious learning algorithms have been proposed for 
DCNNs, and among them the supervised learning 
methods have achieved stunning successes [22] • Re¬ 
cent advances have led to the birth of supervised 
DCNNs with remarkable performances on exten¬ 
sively large and difficult object databases such as 
Imagenet [29llll|. We have selected the eight most 
recent, powerful, and supervised DCNNs and tested 
them in one of the most challenging visual recogni- 
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tion task, i.e. invariant object recognition. Below 
are short descriptions of all the DCNNs that we 
studied in this work. 

Krizhevsky et. al. 2012 This outstand¬ 
ing model reached an impressive performance on 
the Imagenet database and significantly defeated 
other competitors in the ILSVRC-2012 competi¬ 
tion ra The excellent performance of this model 
attracted attention towards the abilities of DCNNs 
and opened a new avenue for further investigations. 
Briefly, the model contains five convolutional (fea¬ 
ture detector) and three fully connected (classifica¬ 
tion) layers. They used the Rectified Linear Units 
(ReLUs) for the neurons’ activation function, which 
significantly speeds up the learning phase. The 
max pooling operation is performed in the first, 
second, and fifth convolutional layers. This model 
is trained using a stochastic gradient descent algo¬ 
rithm. It has about 60 million free parameters; to 
avoid overfitting, they used some data augmenta¬ 
tion techniques to enlarge the training set as well 
as the dropout technique in the learning proce¬ 
dure of the first two fully-connected layers. The 
structural details of this model are presented in Ta¬ 
ble!^ We used the pre-trained version of this model 
(on the Imagenet database) which is publicly re¬ 
leased at http://caffe.berkeleyvision.org by 
Jia et. al [HOj. 

Zeiler and Fergus 2013 To better under¬ 
stand the ongoing functions of different layers in 
Krizhevsky’s model, Zeiler and Fergus [1^ in¬ 
troduced a deconvolutional visualizing technique 
which reconstructs the features learned by each 
neuron. This enabled them to detect and resolve 
deficiencies by optimizing architecture and param¬ 
eters of the Krizhevsky model. Briefly, the visu¬ 
alization showed that the neurons of the first two 
layers were mostly converged to extremely high and 
low frequency information. Besides, they detected 
aliasing artifacts caused by the large stride in the 
second convolutional layer. To resolve these issues, 
they reduced the first layer filter size, from 11x11 
to 7 X 7, and decreased the stride of the convolution 
in the second layer from 4 to 2. The results showed 


a reasonable performance improvement with re¬ 
spect to the Krizhevsky model. The structural de¬ 
tails of this model are provided in Table We 
used the Imagenet pre-trained version of Zeiler and 
Fergus model available at http://libccv.org. 

Overfeat 2014 The Overfeat model na provides 
a complete system to do object classification and 
localization together. Overfeat has been proposed 
in two different types: the Fast model with eight 
layers and the Accurate model with nine layers. 
Although the number of free parameters in both 
types are nearly the same (about 145 million), there 
are about twice as many connections in the Ac¬ 
curate one. It has been shown that the Accurate 
model leads to a better performance on Imagenet 
than the Fast one. Moreover, after the training 
phase, to make decisions with optimal confidence 
and increase the final accuracy, the classification 
can be performed in different scales and positions. 
Overfeat has some important differences with other 
DCNNs: 1) there is no local response normaliza¬ 
tion, 2) the pooling regions are non-overlapping, 
and 3) the model has smaller convolution stride 
(= 2) in the first two layers. The specifications of 
the Accurate version of the Overfeat model, which 
we used in this study, are presented in Table [T] 
Similarly, we used the Imagenet pre-trained model 
which is publicly available at http://cilvr.nyu. 
edu/doku.php?id=software:Overfeat:start. 

Hybrid-CNN 2014 The Hybrid-CNN 
model ra has been designed to do a scene¬ 
understanding task. This model was trained on 
3.6 million images of 1183 categories including 
205 scene categories from the place database and 
978 object categories from the training data of 
the Imagenet database. The scene labeling, which 
consists of some fixed descriptions about the scene 
appearing in each image, was performed by a huge 
number of Amazon Mechanical Turk workers. The 
overall structure of Hybrid-CNN is similar to the 
Krizhevsky model (see Table [^, but it is trained 
on a different dataset to perform a scene under¬ 
standing task. This model is publicly released 
at http://places.csail.mit.edu. Surprisingly, 
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the hybrid-CNN significantly ontperforms the 
Krizhevsky model in different scene-understanding 
benchmarks, while they perform similarly different 
object recognition benchmarks. 

Chatfield CNNs Chatfield et. ah [TH] did an ex¬ 
tensive comparison among the shallow and deep 
image representations. To this end, they pro¬ 
posed three different DCNNs with different archi¬ 
tectural characteristics, each exploring a different 
accuracy/speed trade-off. All three models have 
five convolutional and three fully connected lay¬ 
ers but with different structures. The Fast model 
(CNN-F) has smaller convolutional layers and the 
convolution stride in the first layer is four, versus 2 
for CNN-M and -S, which leads to a higher pro¬ 
cessing speed in the CNN-F model. The stride 
and receptive field of the first convolutional layer is 
decreased in Medium model (CNN-M), which was 
shown to be effective for the Imagenet database [16] . 
The CNN-M model also has a larger stride in the 
second convolutional layer to reduce the computa¬ 
tion time. The Slow model (CNN-S) uses 7 x 7 fil¬ 
ters with stride of 2 in the first layer and larger max 
pooling window in the third and fifth convolutional 
layers. All these models were trained over the Im¬ 
agenet database using a gradient descent learning 
algorithm. The training phase was performed over 
random crops sampled from the whole parts of the 
image rather than the central region. Based on the 
reported results, the performance of CNN-F model 
was close to the Zeiler and Fergus model while both 
CNN-M and CNN-S outperformed the Zeiler and 
Fergus model. The structural details of these three 
models are also presented in Table [T} All these 
models are available at http://www.robots.ox. 
ac.uk/~vgg/software/deep_eval. 

Very Deep 2014 Another important aspect of 
DCNNs is the number of internal layers, which in¬ 
fluences their final performance. Simonyan and Zis- 
serman |32| have studied the impacts of the net¬ 
work depth by implementing deep convolutional 
networks with 11, 13, 16, and 19 layers. To 
this end, they used very small (3 x 3) convolu¬ 
tion filters in all layers, and steadily increased the 


depth of the network by adding more convolutional 
layers. Their results indicate that the recogni¬ 
tion accuracy increases by adding more layers and 
the 19-layer model significantly outperformed other 
DCNNs. They have shown that their 19-layered 
model, trained on the Imagenet database, achieved 
high performances on other datasets without any 
fine-tuning. Here we used the 19-layered model 
available at http://www.robots.ox.ac.uk/~vgg/ 
research/very_deep/. The structural details of 
this model are provided in Table [T] 

Shallow models 

HMAX model The HMAX model [32] has a hi¬ 
erarchical architecture, largely inspired by the sim¬ 
ple to complex cells hierarchy in the primary vi¬ 
sual cortex proposed by Hubei and Wiesel [HUES]- 
The input image is first processed by the SI layer 
(first layer) which extracts edges of different orien¬ 
tations and scales. Complex Cl units pool the out¬ 
puts of SI units in restricted neighborhoods and 
adjacent scales in order to increase position and 
scale invariance. Simple units of the next layers, 
including S2, S2b, and S3, integrate the activi¬ 
ties of retinotopically organized afferent Cl units 
with different orientations. The complex units C2, 
C2b, and C3 pool over the output of the corre¬ 
sponding simple layers, using a max operation, to 
achieve a global position and scale invariance. The 
employed HMAX model is implemented by Jim 
Mutch et. al. [HE] and it is freely available at 
http://cbcl.mit.edu/jmutch/cns/hmax/doc/. 

Pixel representation Pixel representation is 
simply constructed by vectorizing the gray values 
of all the pixels of an image. Then, these vectors 
are given to a linear SVM classifier to do the cate¬ 
gorization. 

Image generation 

All models were evaluated using an image database 
divided into five categories (airplane, animal, car, 
motorcycle, and ship) and seven levels of varia¬ 
tions uni (see Fig. [^. The process of image gen¬ 
eration is similar to Ghodrati et. al. [TH]. Briefly, 
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Table 1: The architecture and settings of different layers of DCNN models. Each row of the table refers to a DCNN model and each 
column contains the details of a layer. The details of convolutional layers (labeled as Conv) are given in three sub-rows: the first one indicates 
the number and the size of the convolution filters as Num x Size x Size; the convolution stride is given in the second sub-row; and the third 
one indicates the max pooling down-sampling rate, and if Linear Response Normalization (LRN) is used. The details of fully connected layers 
(labeled as Full) are presented in two sub-rows: the first one indicates the number of neurons; and the second one whether dropout or soft-max 
operations are applied. 


Model 

Layer 1 

Layer 2 

Layer 3 

Layer 4 

Layer 5 

Layer 6 

Layer 7 

Layer 8 

Layer 9 

Layer 10 

Krizhevsky 
et. al. 2012 

Conv 

96 X 11 X 11 
Stride 4 
LRN, x3 Pool 

Conv 

256 X 5 X 5 
Stride 1 
LRN, x3 Pool 

Conv 

384 X 3 X 3 
Stride 1 

Conv 

384 X 3 X 3 
Stride 1 

Conv 

256 X 3 X 3 
Stride 1 
x3 Pool 

Full 

4096 
drop out 

Full 

4096 
drop out 

Full 

1000 

soft max 

- 

- 

Zeiler and 
Fergus 2013 

Conv 

96 X 7 X 7 
Stride 2 
LRN, x3 Pool 

Conv 

256 X 5 X 5 
Stride 2 
LRN, x3 Pool 

Conv 

384 X 3 X 3 
Stride 1 

Conv 

384 X 3 X 3 
Stride 1 

Conv 

256 X 3 X 3 
Stride 1 
x3 Pool 

Full 

4096 
drop out 

Full 

4096 
drop out 

Full 

1000 

soft max 

- 

- 

OverFeat 

2014 

Conv 

96 X 7 X 7 
Stride 2 
x3 Pool 

Conv 

256 X 7 X 7 
Stride 1 
x2 Pool 

Conv 

512 X 3 X 3 
Stride 1 

Conv 

512 X 3 X 3 
Stride 1 

Conv 

1024 X 3 X 3 
Stride 1 

conv 

1024 X 3 X 3 
Stride 1 
x3 Pool 

Full 

4096 
drop out 

Full 

4096 
drop out 

Full 

1000 

soft max 

- 

Hybrid-CNN 

2014 

Conv 

96 X 11 X 11 
Stride 4 
LRN, x3 Pool 

Conv 

256 X 5 X 5 
Stride 1 
LRN, x3 Pool 

Conv 

384 X 3 X 3 
Stride 1 

Conv 

384 X 3 X 3 
Stride 1 

Conv 

256 X 3 X 3 
Stride 1 
x3 Pool 

Full 

4096 
drop out 

Full 

4096 
drop out 

Full 

1183 

soft max 

- 

- 

CNN-F 

2014 

Conv 

64 X 11 X 11 
Stride 4 
LRN, x2 Pool 

Conv 

256 X 5 X 5 
Stride 1 
LRN, x2 Pool 

Conv 

256 X 3 X 3 
Stride 1 

Conv 

256 X 3 X 3 
Stride 1 

Conv 

256 X 3 X 3 
Stride 1 
x2 Pool 

Full 

4096 
drop out 

Full 

4096 
drop out 

Full 

1000 

soft max 

- 

- 

CNN-M 

2014 

Conv 

96 X 7 X 7 
Stride 2 
LRN, x2 Pool 

Conv 

256 X 5 X 5 
Stride 2 
LRN, x2 Pool 

Conv 

512 X 3 X 3 
Stride 1 

Conv 

512 X 3 X 3 
Stride 1 

Conv 

512 X 3 X 3 
Stride 1 
x2 Pool 

Full 

4096 
drop out 

Full 

4096 
drop out 

Full 

1000 

soft max 

- 

- 

CNN-S 

2014 

Conv 

96 X 7 X 7 
Stride 2 
LRN, x3 Pool 

Conv 

256 X 5 X 5 
Stride 1 
x2 Pool 

Conv 

512 X 3 X 3 
Stride 1 

Conv 

512 X 3 X 3 
Stride 1 

Conv 

512 X 3 X 3 
Stride 1 
x3 Pool 

Full 

4096 
drop out 

Full 

4096 
drop out 

Full 

1000 
soft max 

- 

- 

Very Deep 
2014 

Layer 1 

Layer 2 

Layer 3 

Layer 4 

Layer 5 

Layer 6 

Layer 7 

Layer 8 

Layer 9 

Layer 10 

Conv 

64 X 3 X 3 
Stride 1 

Conv 

64 X 3 X 3 
Stride 1 
x2 Pool 

Conv 

128 X 3 X 3 
Stride 1 

Conv 

128 X 3 X 3 
Stride 1 
x2 Pool 

Conv 

256 X 3 X 3 
Stride 1 

Conv 

256 X 3 X 3 
Stride 1 

Conv 

256 X 3 X 3 
Stride 1 

Conv 

256 X 3 X 3 
Stride 1 
x2 Pool 

Conv 

512 X 3 X 3 
Stride 1 

Conv 

512 X 3 X 3 
Stride 1 

Layer 11 

Layer 12 

Layer 13 

Layer 14 

Layer 15 

Layer 16 

Layer 17 

Layer 18 

Layer 19 

- 

Conv 

512 X 3 X 3 
Stride 1 

Conv 

512 X 3 X 3 
Stride 1 
x2 Pool 

Conv 

512 X 3 X 3 
Stride 1 

Conv 

512 X 3 X 3 
Stride 1 

Conv 

512 X 3 X 3 
Stride 1 

Conv 

512 X 3 X 3 
Stride 1 
x2 Pool 

Full 

4096 
drop out 

Full 

4096 
drop out 

Full 

1000 

soft max 

- 


we built object images with different variation lev¬ 
els, where objects varied across five dimensions, 
namely: size, position (x and y), rotation in-depth, 
rotation in-plane, and background. To generate ob¬ 
ject images under different variations, we used 3-D 
computer models (3-D object images). Variations 
were divided into seven levels from no object vari¬ 
ations (level 1) to mid- and high-level variations 
(level 7). In each level, random values were sam¬ 
pled from uniform distributions for every dimen¬ 
sion. After sampling these random values, we ap¬ 
plied them to the 3-D object model and generated 
a 2-D object image by snapshotting from the var¬ 
ied 3-D model. We performed the same procedure 
for all levels and objects. Note that the magni¬ 
tude of variations in every dimension was randomly 
selected from uniform distributions that were re¬ 
stricted to predefined levels (i.e. from level 1 to 7). 
For example, in level three a random value between 
0° - 30° was selected for in-depth rotation, a ran¬ 
dom value between 0° - 30° was selected for in-plane 
rotation, and so on (see Fig. [^. The size of 2-D im¬ 
ages were 300 x 400 pixels. As shown in Fig. for 


different dimensions, a higher variation level has 
broader variation intervals than the lower levels. 
There were on average 16 3-D image exemplars per 
category. All 2-D object images were then super¬ 
imposed onto randomly selected natural images for 
experiment with natural backgrounds. There were 
over 3,900 natural images collected from the web, 
consisting of a variety of indoor and outdoor scenes. 

Psychophysical experiments 

In total, 26 human subjects participated in a rapid 
invariant object categorization task (17 males and 
9 females, age 21-32, mean age of 26 years). Each 
trial started with a black fixation cross presented 
for 500 ms. Then an image was randomly selected 
from a pool of images and was presented at the cen¬ 
ter of screen for 25 ms (two frames, on a 80 Hz mon¬ 
itor). The image was followed by a uniform blank 
screen presented for 25 ms, as an inter-stimulus in¬ 
terval (ISI). Immediately afterwards, a 1/f noise 
mask was presented for 100 ms to account for feed¬ 
forward processing and minimize the effects of back 
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Figure 1: Sample object images from the database superimposed on randomly selected natural backgrounds. There are five 
object categories, each divided into seven levels of variations. Each 2-D image was rendered from a 3-D computer model. There were, on 
average, 16 various 3-D computer models for each object category. Objects vary in five dimensions: size, position (x, y), rotation in-depth, 
rotation in plane, and background. To construct each 2-D image, we first randomly sampled from five different uniform distributions, each 
corresponding to one dimension. Then, these values were applied to the 3-D computer model, and a 2-D image was then generated. Variation 
levels start from no variations (Level 1, first column at left; note the values on horizontal axis) to high variation (Level 7, last column at right). 
For half of the experiments, objects were superimposed on randomly selected natural images from a large pool of natural images (3,900 images), 
downloaded from the web. 


projections from higher visual areas. This type of 
masking is well established to be used in rapid ob¬ 
ject recognition tasks [371 Ull EH EH |39]. Finally, 
subjects had to select one out of five different cat¬ 
egories using five keys, labeled on the keyboard. 
The next trial started immediately after the key 
press. Stimuli were presented using MATLAB Psy¬ 
chophysics Toolbox B in a 21” CRT monitor with 
a resolution of 1024 x 724 pixels, a frame rate of 80 
Hz, and viewing distance of 60 cm. Each stimulus 
covered 10° x 11° of visual angle. Subjects were in¬ 
structed to respond as fast and accurately as possi¬ 
ble. All subjects voluntarily accepted to participate 
in the experiment and gave their written consent. 
The experimental procedure was approved by the 
local ethic committee. 

According to the “interruption theory” gnEH 
W2\ . the visual system processes stimuli sequen¬ 
tially, so processing of a new stimulus (the noise 
mask) will interrupt the processing of the previ¬ 
ous stimulus (the object image) before it can be 
modulated by the feedback signals from higher ar¬ 


eas [Hni- In our experiment, there is a 50 ms Stim¬ 
ulus Onset Asynchrony (SOA) between the object 
image and the noise mask (25 ms for image presen¬ 
tation and 25 ms for ISI). This SOA can disrupt 
IT-V4 (- 40 - 60 ms) and IT-Vl (- 80 - 120 ms) 
feedback signals, while it leaves the feed-forward 
information sweep intact [33] • Using Transcranial 
Magnetic Stimulation B, it has been shown that 
applying magnetic pulses between 30 to 50 ms after 
stimulus onset will disturb the feed-forward visual 
information processing in the visual cortex. Thus, 
SOAs shorter than 50 ms would make the catego¬ 
rization task much harder by interrupting the feed¬ 
forward information flow. 

Experiments were held in two sessions: in the 
first one, the objects were presented with a uni¬ 
form gray background, and in the second one, a ran¬ 
domly selected natural background was used. Some 
subjects completed two sessions while others only 
participated in one session, so that each session was 
performed by 16 subjects. Each experimental ses¬ 
sion consisted of four blocks; each one containing 
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175 images (in total 700 images; 100 images per 
variation level, 20 images from each object category 
in each level). Snbjects conld rest between blocks 
for 5-10 minutes. Subjects performed a few training 
trials before starting the actual experiment (none 
of the images in these trials were presented in the 
main experiment). A feedback was shown to sub¬ 
jects during the training trials, indicating whether 
they responded correctly or not, but not during the 
main experiment. 

Model evaluation 

Classification accuracy: To evaluate the classi¬ 
fication accuracy of the models, we first randomly 
selected 600 images from each object category, vari¬ 
ation level, and background condition (see Image 
generation section). Hence, we have 14 different 
datasets (7 variation levels x 2 background con¬ 
ditions), each of which consists of 3000 images (5 
categories x 600 images). To compute the accuracy 
of each DCNN for a given variation level and back¬ 
ground condition, we randomly selected two sub¬ 
sets of 1500 training (300 images per category) and 
750 testing images (150 images per category) from 
the corresponding image dataset. We then fed the 
pre-trained DCNN with the training and testing 
images and calculated the corresponding feature 
vectors for all layers. Afterwards, we used these fea¬ 
ture vectors to train the classifier and compute the 
recognition accuracy of each layer. Here we used a 
linear SVM classifier (libSVM implementation |13] , 
www.csie.ntu.edu.tw/~cjlin/libsvm) with op¬ 
timized regularization parameters. This procedure 
was repeated for 15 times (with different randomly 
selected training and testing sets) and the average 
and standard deviation of the accuracy were com¬ 
puted. This procedure was done for all models, 
levels, and layers. 

For the HMAX and Pixel models, we first ran¬ 
domly selected 300 and 150 images (from each cat¬ 
egory and each variation level) as the training and 
testing sets, and then, computed their correspond¬ 
ing features. The visual prototypes of the S2, S2b 
and S3 layers of the HMAX model were randomly 
extracted from the training set, and the outputs of 
C2, C2b, and C3 layers were used to compute the 


performance of the HMAX model. Pixel represen¬ 
tation for each image is simply a vector of pixels’ 
gray values. Finally, the feature vectors were ap¬ 
plied to a linear SVM classifier. The reported ac¬ 
curacies are the average of 15 independent random 
runs. 

Confusion matrix: We also computed the confu¬ 
sion matrices for models and humans in all varia¬ 
tion levels, both for objects on uniform and natu¬ 
ral backgrounds. A confusion matrix allows us to 
determine which categories are more misclassified 
and how classification errors are distributed across 
different categories. For the models, confusion ma¬ 
trices were calculated from the labels assigned by 
the SVM. To obtain the human confusion matrix, 
we averaged the confusion matrices of all human 
subjects. 

Representational dissimilarity ma¬ 
trix (RDM) 

Model RDM: RDM provides a useful and illus¬ 
trative tool to study the representational geometry 
of the response to different images, and checking 
whether images of the same category generate sim¬ 
ilar responses in the representational space. Each 
element in a RDM shows the pairwise dissimilar¬ 
ity between the response patterns elicited by two 
images. Here these dissimilarities are measured 
using Spearman’s rank correlation distance (i.e., 
1—correlation). Moreover, RDMs is a useful tool 
to compare different representational spaces with 
each other. Here, we used RDMs to compare the 
internal representations of the models with human 
behavioral responses (see below). To calculate the 
RDMs, we used the RSA toolbox developed by Nili 
et. al. jBj. 

Human RDM: Since we did not have access to 
the human internal object representations in our 
psychophysical experiment, we used the human be¬ 
havioral scores to compute the RDMs (See [12] for 
more details). Actually, for each image, we com¬ 
puted the relative frequencies with which the im¬ 
age is assigned to different categories by all human 
subjects. Hence, we have a five-element vector for 
each image, which is used to construct the human 
RDM. Although, computing human RDMs based 
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on behavioral responses is not a direct measure¬ 
ment of the representational content of the human 
visual system, it provides a way to compare inter¬ 
nal representations of DCNN models to behavioral 
decisions of humans. 
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Figure 2: Classification accuracy of models and humans in mul¬ 
ticlass invariant object categorization task across seven levels 
of object variations. A. Accuracies when objects were presented on 
uniform backgrounds. Each colored curve shows the accuracy of one 
model (specified in the legend). The gray curve indicates human cat¬ 
egorization accuracy across seven levels. All models were well above 
chance level (20%). The right panel shows the accuracies of both mod¬ 
els and humans at the last level of variations (level seven; specified with 
pale, red rectangular), in ascending order. Level seven is considered 
the most difficult level as the variations are high at this level, making 
the categorization difficult for models and human. The color-coded 
matrix, at the top-right of the bar plot, exhibits the p-values for all 
pairwise comparisons between human and models computed using the 
Wilcoxon rank sum tests. For example, the accuracy of the Hybrid- 
CNN was compared to the human and all other models and the pair¬ 
wise comparison provides us with a p-value for each comparison. Blue 
points indicate that the accuracy difference is significant while gray 
points show insignificant differences. Numbers, written around the p- 
value matrix, correspond to models (H stands for human). Accuracies 
are reported as the average and standard deviation of 15 random, inde¬ 
pendent runs. B. Accuracies when objects were presented on randomly 
selected natural backgrounds. 


Results 

We tested the DCNNs in our invariant object cat¬ 
egorization task including five object categories, 
seven variation levels, and two background condi¬ 
tions (see Materials and methods). The categoriza¬ 
tion accuracy of these models were compared with 
those of human subjects, performing rapid invari¬ 
ant object categorization tasks on the same images. 
For each model, variation level, and background 
condition, we randomly selected 300 training im¬ 
ages and 150 testing ones per object category from 
the corresponding image dataset. The accuracy 
was then calculated over 15 random independent 
runs and the average and standard deviation were 
reported. We also analyzed the error distributions 
of all models and compared them to those of hu¬ 
mans. Finally, we compared the representational 
geometry of models and humans, as a function of 
the variation levels. 

DCNNs achieved human-level accu¬ 
racy 

We compared the classification accuracy of the fi¬ 
nal layer of all models (DCNNs, and HMAX repre¬ 
sentation) with those of human subjects doing the 
invariant object categorization tasks in all variation 
levels and background conditions. Figure]^ shows 
that almost all DCNNs achieved human-level accu¬ 
racy across all levels when objects had a uniform 
gray background. The accuracies of DCNNs are 
even better than humans at low (levels 1 to 3) and 
intermediate (levels 4 and 5) variation levels. This 
might be due to inevitable motor errors that hu¬ 
mans made during the psychophysical experiment, 
meaning that subjects might have perceived the 
image but pressed a wrong key. Also, it can be 
seen that the accuracies of humans and almost all 
DCNNs are virtually flat across all variation lev¬ 
els which means they are able to invariantly clas¬ 
sify objects with uniform background. Surprisingly, 
the accuracy of Overfeat is far below the human- 
level accuracy, even worse than the HMAX model. 
This might be due to the structure and the number 
of features extracted by the Overfeat model which 
leads to a more complex feature space with high 
























redundancy. 

We compared the accuracy of humans and mod¬ 
els at the most difficult level (7). There is no signif¬ 
icant difference between the accuracies of CNN-S, 
CNN-M, Zeiler and Fergus, and human at this vari¬ 
ation level (Fig. 11^. bar plot; Also, see pairwise 
comparisons shown using a p-value matrix com¬ 
puted by the Wilcoxon rank sum test). CNN-S is 
the best model. 

When we presented object images superimposed 
on natural backgrounds, the accuracies decreased 
for both humans and models. Figure illustrates 
that only three DCNNs (CNN-F, CNN-M, CNN- 
S) performed close to human. The accuracy of 
the HMAX model dropped down just above chance 
level (i.e., 20%) at the seventh variation level. In¬ 
terestingly, the accuracy of Overfeat remained al¬ 
most constant either in objects on uniform or natu¬ 
ral backgrounds, suggesting that this model is more 
suitable for tasks with unsegmented images. Simi¬ 
larly, we compared the accuracies at the most diffi¬ 
cult level (level 7) when objects had natural back¬ 
grounds. Again, there is no significant difference 
between the accuracies of CNN-S, CNN-M, and hu¬ 
mans (see the p-value matrix computed using the 
Wilcoxon rank sum test for all possible pairwise 
comparisons). However, the accuracy of human 
subjects is significantly above the HMAX model 
and other DCNNs (i.e., CNN-F, Zeiler and Fergus, 
Krizhevsky, Hybrid-CNN, and Overfeat). 

How accuracy evolves across layers in 
DCNNs 

DCNNs have a hierarchical structure of different 
processing stages in which each layer extracts a 
large pool of features (e.g., > 4000 features at top 
layers). Therefore, the computational load of such 
models is very high. This raises important ques¬ 
tions: what is the contribution of each layer to the 
final accuracy? and how does the accuracy evolve 
across the layers? 

We addressed these questions by calculating the 
accuracy of each layer of the models across all varia¬ 
tion levels. This provides us with the contribution 
of each layer to the final accuracy. Figure ||A-H 
shows the accuracies of all layers and models when 


objects had uniform gray background. The accura¬ 
cies of the Pixel representation (dashed, dark pur¬ 
ple curve) and human (gray curve) are also shown 
on each plot. 

Overall, the accuracies significantly evolved 
across layers of DCNNs. Moreover, almost all lay¬ 
ers of the models (except Overfeat), even Pixel rep¬ 
resentation, achieved perfect accuracies at low vari¬ 
ation levels (i.e., levels 1 and 2), suggesting that 
this task is very simple when objects had small 
variations and uniform gray background. Look¬ 
ing at the intermediate and difficult variation levels 
shows that the accuracies tend to increase as we go 
up across the layers. However, the trend is differ¬ 
ent between layers and models. For example, layers 
2, 3, and 4 in three DCNNs (Krizhevsky, Hybrid- 
CNN, Zeiler and Fergus) have very similar accu¬ 
racies across the variation levels (Fig. if. B, and 
G). Similar results can be seen for these models in 
layers 5, 6, and 7 (Fig. Ih. B, and G). In contrast, 
there is a high increase in accuracies from layer 1 to 
4 for GNN-F, GNN-M, and GNN-S, while the three 
last layers have similar accuracies. There is also a 
gradual increase in the accuracy of Overfeat from 
layer 2 to 5 (with the similar accuracy for layers 6, 
7, and 8); however, there is a considerable decrease 
at the output layer (Fig. |^). Moreover, the over¬ 
all accuracy of Overfeat is low compared to humans 
and other models as previously seen in Fig. 

Interestingly, the accuracy of HMAX, as a shal¬ 
low model, is far below the accuracies of DCNNs 
(C2b is the best performing layer). This shows 
the important role of supervised deep learning in 
achieving high classification accuracy. As expected, 
the accuracy of Pixel representation exponentially 
decreased down to 30% at level seven, confirming 
the fact that invariant object recognition requires 
multi-layered architectures (note that the chance 
level accuracy is 20%). We note, however, that 
Pixel performs very well with no viewpoint varia¬ 
tions (level 1). 

We also compared the accuracies of all layers 
of the models with those of humans. Golor-coded 
points at the top of each plot in Fig. [^indicate the 
p-values of the Wilcoxon rank sum test. The aver¬ 
age accuracy of each layer across all variation levels 
is shown on the pink area at the right side of each 
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Figure 3: Classification accuracy of models (for all layers separately) and humans in multiclass invariant object categorization 
task across seven levels of object variations, when objects had uniform backgrounds. A. Accuracy of Krizhevsky et. al. 2012 across 
all layers and levels. Mean accuracies and s.e.m. are reported using 15 random, independent runs. Each colored curve shows the accuracy of 
one layer of the model (specified on the bottom-left legend). The accuracy of Pixel representation is depicted using a dashed, dark purple curve. 
The gray curve indicates human categorization accuracy across seven levels. The chance level is 20%; no layer hit the chance level for this task 
(note that the accuracy of Pixel representation dropped down to 10% above chance at level seven). The color-coded points at the top of the 
plot indicate whether there is a significant difference between the accuracy of humans and model layers (computed using the Wilcoxon rank 
sum test). Each color refers to a p-value, specified on the top-right (*: p < 0.05, **: p < 0.01, * * *: p < 0.001, * * p < 0.0001). Colored 
circles on the pink area, show the average accuracy of each layer, across all variation levels (one value for each layer and all levels), with the 
same color code as curves. The horizontal lines, depicted underneath the circles, indicate whether the difference between human accuracy (gray 
circle) and layers of the model is significant (computed using the Wilcoxon rank sum test; black line: significant, white line: insignificant). B-H. 
Accuracies of Hybrid-CNN, Overfeat, CNN-F, CNN-M, CNN-S, Zeiler and Fergus, and HMAX model, respectively. L The average accuracy 
across all levels for each layer of each model (again error bars are s.e.m.). Each curve corresponds to a model. This simply summarizes the 
accuracies, depicted in the pink areas. The shaded area shows the average baseline accuracy (pale-purple. Pixel representation) and human 
accuracy (gray) across all levels. 


plot, summarizing the contribution of each layer 
to final accuracy independently of variation levels. 
Horizontal lines on the pink area show whether the 
average accuracy of each layer is significantly differ¬ 


ent from those of humans (black: significant; white: 
insignificant). Furthermore, Fig.summarizes the 
results depicted on the pink areas, confirming that 
the last three layers in DCNNs (except Overfeat) 
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Figure 4: Classification accuracy of models (for all layers separately) and human in multiclass invariant object categorization 
task across seven levels of object variations, when objects had natural backgrounds. A-H. Accuracies of Krizhevsky et. ah, 
Hybrid-CNN, Overfeat, CNN-F, CNN-M, CNN-S, Zeiler and Fergus, and HMAX model across all layers and variation levels, respectively. L 
The average accuracy across all levels for each layer of each model (again error bars are s.e.m.). Details of diagrams are explained in the caption 
of Fig. 1^ 


have similar accuracies. 

We also tested the models on objects with nat¬ 
ural backgrounds to see whether the contributions 
of similarly performing layers change in more chal¬ 
lenging tasks. Not surprisingly, the accuracy of hu¬ 
man subjects dropped by 10% at low variation level 
(level 1), and down to 25% at high variation level 
(level 7) with respect to the uniform background 
case (Fig.|^ gray curve). Not surprisingly, the Pixel 
representation shows an exponential decline in the 


accuracy across the levels, with the chance accuracy 
at level seven (Fig. dashed dark purple curve). 
Similar to Fig. all DCNNs, excluding Overfeat, 
achieved close to human-level accuracy at low vari¬ 
ation levels (levels 1, 2, and 3). Interestingly, the 
Pixel representation performed better than most 
models at level one, suggesting that object catego¬ 
rization at low variation level can be done without 
elaborate feature extraction methods (note that we 
had only five object categories, therefore, this can 
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Figure 5: Classification accuracy at easy (level 1), intermediate (level 4) and difficult (level 7) levels for different layers of the 
models. A-C. Accuracy for different layers at easy (A), intermediate (B) and difficult (C) levels when objects had uniform backgrounds. Each 
curve represents the accuracy of a model. The shaded areas show the accuracy of the Pixel representation (pale purple) and human (gray). 
Error bars are standard deviation. D-F. Idem when objects had natural backgrounds. 


be different with more categories). 

The severe drop in the accuracy of the HMAX 
model with respect to the uniform background ex¬ 
periment reflects the difficulty of this model to cope 
with distractors in natural backgrounds. For both 
background conditions, the C2b layer has higher 
accuracy than C3 layer and can better tolerate ob¬ 
ject variations. The main reason why HMAX is 
not performing as well as DCNNs is probably the 
lack of a purposive learning rule gSlEI]. HMAX 
randomly extracts a large number of visual features 
(image crops) which could be highly redundant, un¬ 
informative, and even misleading H. The issue of 
inappropriate features becomes more evident when 
the background is clutter. 

Another noticeable fact about DCNNs in the nat¬ 
ural background experiment is the superiority of 
the last convolutional layers with respect to the 
fully connected layers; for example, the accuracy 
of the fifth layer in the Krizhevsky model is higher 
than the seventh layer’s. One possible reason for 
the low accuracies in the final layers of DCNNs is 
that the fully connected layers are designed to per¬ 


form classification themselves, and not to provide 
input for a SVM classifier. Besides, the fully con¬ 
nected layers were optimized for Imagenet classifi¬ 
cation, but not for our dataset. A last reason could 
be that the convolutional layers have more features 
than the fully connected layers. 

Given the accuracies of all layers, it can be seen 
that the accuracies evolved across the layers. How¬ 
ever, similar to Fig. layers 2, 3, and 4 of Krizh- 
esvky, Zeiler and Fergus, and Hybrid-CNN con¬ 
tribute almost equally to the final accuracy. Again, 
CNN-F, CNN-M, and CNN-S showed a different 
trend in terms of the contribution of each layer to 
the final accuracy. Moreover, as shown in Fig. IP- 
F, only these three models achieved human-level 
accuracy at difficult levels (levels 6 and 7). The ac¬ 
curacies of other DCNNs, however, are significantly 
lower than humans at these levels (see the color- 
coded points in Fig. |4]A-C, G which indicate the p- 
values computed by the Wilcoxon rank sum tests). 
We summarized the average accuracies across all 
levels for each layer of the models, shown as color- 
coded circles with error bars on the pink areas next 
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Figure 6: Confusion matrices for multiclass invariant object categorization task. A. Each color-coded matrix shows the confusion 
matrix of a model when categorizing different object categories (specified in the first matrix at the top-left corner), when images had uniform 
backgrounds. Each row corresponds to a model. Last row shows human confusion matrix. Each column indicates a particular level of variation 
(levels 1 to 7). Models’ name is depicted at the right end. B. Idem with natural backgrounds. The color bar at the top-right shows the 
percentage of the labels assigned to each category, The chance level indicated with an arrow. Confusion matrices were calculated only for the 
last layers of the models. 


to each plot. In most cases, layer 5 (the last con¬ 
volutional layer - layer 6 in Overfeat) has the high¬ 
est accuracy among layers. This is summarized in 
Fig. 0. which is actually the summary of results 
shown on pink areas. Figure |4|! also confirms that 
only CNN-F, CNN-M, and CNN-S achieve human- 
level accuracy. 

We further compared the accuracies of all layers 
of the models with humans at the easy (level 1), in¬ 
termediate (level 4) and difficult (level 7) variation 
levels to see how each layer performs the task as 
the level of variations increases. Figure [^-C show 
the accuracies for the uniform background condi¬ 
tion. The easy level is not very informative be¬ 
cause of a ceiling effect: all models (but Overfeat) 
reach 100% accuracy. At the intermediate level, all 
DCNNs (except Overfeat) reached the human-level 
accuracy from layer 4 upwards (Fig. [^), suggest¬ 
ing that even with intermediate level of variation, 
DCNNs have remarkable accuracies (note that ob¬ 


jects had uniform background). This is clearly not 
true for the HMAX and Overfeat networks. How¬ 
ever, when models were fed with images from the 
most difficult level, only the last layers (layers 5, 6, 
and 7) achieved human-level accuracy (see Fig.|^). 
Notably the last three layers have almost similar 
accuracies. 

When objects had natural backgrounds, some¬ 
what surprisingly the accuracies of all DCNNs (but 
Overfeat) is maximal with layer 2, and drops for 
subsequent layers. This shows that deeper is not 
always better. The fact that the Pixel representa¬ 
tion performs well at this level confirms this find¬ 
ing. At the intermediate level, the picture is differ¬ 
ent: only the last three layers of DCNNs, excluding 
Overfeat, reach human-level accuracy (see Fig.[^). 
Finally, at the seventh variation level. Figure IP 
shows that only three DCNNs reach human perfor¬ 
mance: CNN-F, CNN-M, and CNN-S. 

In summary, the above results, taken together. 
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illustrate that some DCNNs are as accurate as hu¬ 
mans, even at the highest variation levels. 

Do DCNNs and humans make similar 
errors? 

The accuracies reported in the previous section 
only represent the ratio of correct responses. In¬ 
deed, they did not reflect whether models and hu¬ 
mans made similar misclassifications. To do a more 
precise and category-based comparison between the 
recognition accuracies of humans and models, we 
computed the confusion matrices for each variation 
level. Figure [^provides the confusion matrices for 
humans and the last layers of all models for both 
uniform (see Fig. ih) and natural (see Fig. |^) 
backgrounds, and for each variation level. 

Despite a very short presentation time in the 
behavioral experiment, humans performed remark¬ 
ably well at categorizing five object classes, either 
when object had uniform (Fig. [^, last row) or 
natural (Fig. |^, last row) backgrounds, with min¬ 
imum misclassifications across different categories 
and levels. It is, however, important to point out 
that the majority of human errors corresponded to 
ship - airplane confusions. This was probably due 
to the shape similarity among these objects (e.g., 
both categories usually have bodies, sails, wings, 
etc.). 

Figure demonstrates that the HMAX model 
and Pixel representation misclassifled almost all 
categories at high variation levels. With natural 
backgrounds, they uniformly assigned input images 
into different classes. Conversely, DCNNs show few 
classification errors across different categories and 
levels, though the distribution of errors is differ¬ 
ent from one model to another. For example, the 
majority of recognition errors made by Krizehvsy, 
Zeiler and Fergus, and Hybrid-CNN belonged to 
car and motorcycle classes, while animal and air¬ 
plane classes were mostly misclassifled by CNN- 
F, CNN-M, and CNN-S. Finally, Overfeat shows 
evenly-distributed errors across categories, confirm¬ 
ing its low accuracy. 

We also examined whether models’ decisions are 
similar to those of humans. To this end, we com¬ 
puted the similarity between the humans’ confusion 


matrices and those of the models. An important 
point is to factor out the impacts of the mean ac¬ 
curacies (of humans and models) on the similarity 
measure, to only take the error distributions into 
account. Therefore, for each confusion matrix, we 
first excluded the diagonal terms and arranged the 
remaining elements in a vector and normalized it 
by its L2 norm. Then, the similarity between two 
confusion matrices is computed using the Euclidean 
distance between their corresponding vectors sub¬ 
tracted from one (here we call it as i - Norm. Eu¬ 
clidean distance). In this way, we are just compar¬ 
ing the error distributions of humans and models 
independent of their accuracies. Figure [^provides 
the similarities between models and humans across 
all layers and levels when objects had uniform back¬ 
ground. Almost all models, including the Pixel rep¬ 
resentation, show the maximum possible similarity 
at low variation levels (levels 1 and 2). However, 
the similarity of Pixel representation exponentially 
decreases from level 2 upwards. Overall, the high¬ 
est layers of DCNNs (except Overfeat) are more 
similar to humans’ decisions. This point is also 
shown in Figure [^, which represents the average 
similarities across all variation levels (each curve 
corresponds to one model). Note that due to the 
high recognition accuracies in uniform background 
condition, this level of similarity was predictable. 

The similarity between models’ and humans’ er¬ 
rors, however, decreases in the case of images with 
natural backgrounds. The HMAX model had the 
lowest similiarity with human (see Fig. §. Al¬ 
though DCNNs have reached human-level accuracy, 
their decisions and distribution of errors are dif¬ 
ferent from human’s. Interestingly, the Overfeat 
has almost a constant similarity across layers and 
levels. Comparing the similarities across DCNNs 
shows that CNN-F, CNN-M, and CNN-S have the 
highest similarities to humans, which is also re¬ 
flected in Fig. 1^. 

To summarize our results so far: the best DCNNs 
can reach human performance even at the highest 
variation level, but their error distributions are dif¬ 
ferent to the average human one (similarity < 1 on 
Fig. §. However, one needs a reference here, be¬ 
cause humans also differ between each other. Are 
these difference between humans smaller than dif- 
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Figure 7: Similarity between models’ and humans’ confusion matrices when images had uniform backgrounds. A. Similarity 
between Krizhevsky et al. 2012 confusion matrices and that of humans (measured as l-normalized Euclidean distance). Each curve shows the 
similarity between human confusion matrix and one layer of Krizhevsky et al. 2012 (specified on the right legend), across different levels of 
variations. The similarity between the confusion matrix of the Pixel representation and humans is shown using a dark purple, dashed line. B-H. 
Idem for the Hybrid-CNN, Overfeat, CNN-F, CNN-M, CNN-S, Zeiler and Fergus, and HMAX models, respectively. I. The average similarity 
across all levels for each layer of each model (error bars are s.e.m.). Each curve corresponds to one model. 


ferences between humans and DCNNs? To investi¬ 
gate this issue, we used the multidimensional scal¬ 
ing (MDS) method to visualize the distances (i.e., 
similarities) between the confusion matrices of hu¬ 
mans and models (last layer) in 2-D maps (see Fig¬ 
ure]^. Each map corresponds to a certain variation 
level and background condition. 

In the uniform background condition, humans 
have small inter-subject distances. As we move 
from low to high variations, the distance between 
DCNNs and humans becomes greater. In high vari¬ 
ation levels, the Overfeat, HMAX, and Pixel mod¬ 
els are very far from the human subjects as well as 
from the other DCNNs. The other models remain 
indiscernible from humans. 

In the natural background condition, the hu¬ 
man between-subject distances are relatively higher 


than in the uniform condition. As the level of varia¬ 
tions increases, the models tend to get further away 
from the human subjects. But the CNN-F, CNN- 
M, and CNN-S are difficult, if not impossible, to 
discern from humans. 


So far, we have analyzed the accuracies and error 
distributions of models and humans, when features 
were used by a SVM classifier. However, such anal¬ 
yses do not inform us about the internal represen¬ 
tational geometry of models and their similarities 
to those of humans. It is very important to investi¬ 
gate how different categories are represented in the 
feature space. 
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Figure 8: Similarity between models’ and humans’ confusion matrices, when object images had natural backgrounds. A-H. 

Similarities between the confusion matrices of Krizhevsky, Hybrid-CNN, Overfeat, CNN-F, CNN-M, CNN-S, Zeiler and Fergus, HMAX model 
and that of humans. Figure conventions are identical to Fig. L The average similarity across all levels for each layer of each model (error 
bars are s.e.m.). Each curve corresponds to a model. 


Representational geometry of models 
and human 

Representational similarity analysis has become a 
popular tool to study the internal representation 
of models |20l HU |271 08] in response to different 
object categories. The representational geometries 
of models can then be compared with neural re¬ 
sponses independently of the recording modality 
(e.g. fMRI [m [20], cell recording [HI 07] [27], be¬ 
havior [50] EU |52l [19], and MEG [^), showing to 
what degree each model resembles the brain rep¬ 
resentations. Here, we calculated representational 
dissimilarity matrices (RDM) for models and hu¬ 
mans [H]. We then compared the RDMs of hu¬ 
mans and each model and quantified the similarity 
between these two. Model RDMs were calculated 
based on pairwise correlation between the feature 
vectors of two images (see Materials and methods). 


To calculate the human RDM, we used their behav¬ 
ioral scores recorded in the psychophysical experi¬ 
ment (see Materials and methods as well as [IS]). 

Figure [T^ represents the RDMs for models and 
human across different levels of variation both 
for objects on uniform (Fig. [TojA.) and natural 
(Fig. [fob) backgrounds. Note that these RDMs 
are calculated from the object representations in 
the last layers of the models. For better visualiza¬ 
tion, we show only 20 images from each category; 
therefore, the size of RDMs is 100 x 100 (reported 
RDMs were averaged over six random runs). 


As expected, human RDM clearly represents 
each object category, with minimum intra-class dis¬ 
similarity and maximum inter-class dissimilarity, 
across all variation levels (last row in Fig. 10 A and 
Fig. [Tob for uniform and natural backgrounds, re¬ 
spectively). However, both HMAX and Pixel rep¬ 
resentation show a random pattern in their RDMs 
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Figure 9: The distances between models and humans visualized using the multidimensional scaling (MDS) method, distances 
between models and humans when images had uniform (A) and natural backgrounds (B). Light gray circles show the position of each human 
subjects and larger black circle shows the average of all subjects. Color circles represent models. 


when objects had natural backgrounds (Fig. [lob, 
rows 8 and 9), suggesting that such low and in¬ 
termediate visual features are unable to invariantly 
represent different object categories. The situation 
is slightly better when object had uniform back¬ 
ground (Fig.[l0]A., rows 8 and 9). In this case, there 
is some categorical information, mostly across low 
variation levels (levels 1 to 3, and 4 to some extent), 
for animal, motorcycle, and airplane images. Such 
information is attenuated at intermediate and high 
variation levels. 

In contrast, DCNNs demonstrate clear categori¬ 
cal information for different objects across almost 


all levels, for both background conditions. Cate¬ 
gorical information is more evident when objects 
had uniform background, even at high variation 
levels, while this information almost disappears at 
intermediate levels when object had natural back¬ 
grounds. In addition. Overfeat did not clearly rep¬ 
resent different object categories. The Over feat 
model is one of the most powerful DCNNs with 
high accuracy on the Imagenet database, but it 
seems that the features are not suitable for our in¬ 
variant object recognition task. It uses no fewer 
than 230400 features! This might be one reason for 
poor representational power: it probably leads to a 
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A. All Models Last Layer - Uniform Background B. All Models Last Layer - Natural Background 



Figure 10: Representational Dissimilarity Matrices (RDM) for models and humans. RDMs for humans and models when images 
had uniform (A) and natural (B) backgrounds. Each element in a matrix shows the pairwise dissimilarities between the internal representations 
of the two images (measured as 1— Spearman’s rank correlation). Each row of RDMs corresponds to a model (specified on the right) and each 
column corresponds to a particular level of variation (from level 1 to 7). Last row illustrates the human RDMs, calculated from the behavioral 
responses. The color bar on the top-right corner shows the degree of dissimilarity. For the sake of visualization, we only included 20 images 
from each category, leading to 100 x 100 matrices. Model RDMs were calculated for the last layer of each model. 


nested and complex object representation. Besides, 
this high number of features may also explain the 
poor classification performance we obtained, due to 
overfitting. 

Based on visual inspection, it seems that some 
DCNNs are better at representing some specific 
categories. For example, Krizhevsky, Hybrid-CNN, 
Zeiler and Fergus could better represent animal, car 
and airplane classes (lower within-class dissimilar¬ 
ity for these categories), while ship and motorcycle 
classes are better represented by CNN-F, CNN-M, 
and CNN-S. Interestingly, this has been reflected 
on the confusion matrix analysis, suggesting that 
combining and remixing of features from these DC¬ 
NNs could result in a more robust invariant object 
representation PI. 

To quantify the similarity between models’ and 
humans’ RDMs, we calculated the correlation be¬ 
tween them across all layers and levels (measured as 
Kendall rank correlation). Each panel in Fig. [TT| 


and Fig. [^represents the correlation between mod¬ 
els’ and humans’ RDMs across all layers and vari¬ 
ation levels (each color-coded curve corresponds to 
one layer) when object had uniform and natural 
backgrounds, respectively. Overall, as shown in 
these figures, the correlation coefficients are high at 
low variation levels , but decrease at higher levels. 
Moreover, correlations are not significant at very 
difficult levels, as specified with color-coded points 
on the top of each plot (blue point: significant, gray 
point: insignificant). 

Interestingly, comparing the cases of uniform 
(Fig. [II| and natural (Fig. backgrounds indi¬ 
cates that the maximum correlation (~ 0.3 at level 
1) did not change a lot. However, for the uniform 
background condition, the correlation across other 
levels increased to some extent. Besides, it can also 
be seen that the correlations of the HMAX model 
and Pixel representation are higher and more sig¬ 
nificant than with natural backgrounds (Fig. mil 
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Figure 11: Correlation between humans’ and models’ RDMs, across different layers and levels, when objects had uniform 
backgrounds. A. Correlation between human RDM and Krizhevsky et. al. 2012 RDM (Kendall Ta rank correlation), across different layers 
and levels of variations. Each color-coded curve shows the correlation of one layer of the model (specified on the right legend) with the 
corresponding human RDM. The correlation of Pixel representation with human RDM is depicted using a dashed, dark purple curve. The 
color-coded points on the top of the plots indicate whether the correlation is significant. Blue points indicate significant correlation while gray 
points show insignificant correlation. Correlation values are the average over 10,000 bootstrap resamples. Error bars are the standard deviation. 
B-H. Idem for Hybrid-CNN, Overfeat, CNN-F, CNN-M, CNN-S, Zeiler and Fergus, and HMAX, respectively. 1. The average correlation across 
all levels for each layer of each model (error bars are STD). Each curve corresponds to one model. The shaded area shows the average correlation 
for the Pixel representation across all levels. All correlation values were calculated using the RSA toolbox (Nili et ah, 2014). 


and Fig. [I^). Note that the correlation values of 
the first layer of almost all DCNNs (but Zeiler and 
Fergus) are similar to those of Pixel representa¬ 
tion, suggesting that in the absence of viewpoint 
variations, very simple features (i.e., gray values of 
pixels) can achieve acceptable accuracy and corre¬ 
lation. This means that DCNNs are built to per¬ 
form more complex recognition tasks, as it has been 
shown in several studies. 

Not surprisingly, in the case of natural back¬ 


ground, the correlation between Pixel and human 
RDMs are very low and almost insignificant at all 
levels (Fig. dashed dark purple fine copied on 
all panels). Similarly, the HMAX model shows a 
very low and insignificant correlation across all lay¬ 
ers and levels. We also expected a low correlation 
for the Overfeat model, as shown in Fig. [T^. In¬ 
terestingly, the correlation increases as images are 
processed across consecutive layers in DCNNs, with 
lower correlations at early layers and higher corre- 
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Figure 12: Correlation between humans’ and models’ RDMs, across different layers and levels, when objects had natural 
backgrounds. A-H. Correlation between humans’ RDM and the one of KirZhevsky, Hybrid-CNN, Overfeat, CNN-F, CNN-M, CNN-S, Zeiler 
and Fergus, and HMAX, across all layers and levels. Figure conventions are identical to Fig. [mi. The average correlation across all levels for 
each layer of each model (error bars are STD). 


lations at top layers (layer 5, 6, and 7). As for the 
accuracy results, the correlations of fully connected 
layers of DCNNs are very similar to each other, sug¬ 
gesting that these layers do not greatly add to the 
hnal representation. 

We snmmarized the correlation resnlts in Fig. m 
and Fig. [T^, by averaging the correlation coef¬ 
ficients across levels for every model layer. It 
is shown that the correlations for DCNNs evolve 
across layers, with low correlations at early lay¬ 
ers and high correlations at top layers. More¬ 
over, Fig. 113 shows that the correlation of the 
HMAX model (all the layers) with human fluctu¬ 
ates around the correlation of Pixel representation 


(specified with shaded area). 

Note that although the correlation coefficients 
are not very high (~ 0.2), Zeiler and Fergus, 
Hybrid-CNN, and Krizhevsky models are the most 
human-like. It is worth noting that the best mod¬ 
els in terms of performance, CNN-F, CNN-M, and 
CNN-S do not have the most human-like RDMs. 
Conversely, the model with the most hnman-like 
RDM, Zeiler and Fergns, is not the best in terms 
of classification performance. 

More research is needed to nnderstand why 
the Zeiler and Fergns’ RDM is significantly more 
human-like than those of other DCNNs. This find¬ 
ing is consistent with a previous study by Cadieu et 
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al.[27], in which the Zeiler and Fergus’ RDM was 
found be more similar to monkey IT RDM than 
those of the Krizhevsky and HMAX models. 

We also computed the category separability in¬ 
dex for the internal representations of each model 
by computing the ratio of within-category rela¬ 
tive to between-category dissimilarities (results are 
not shown here).This experiment also confirms that 
models with higher separability indexes do not nec¬ 
essarily perform better than other models. In fact, 
it is the actual positions of images of different cat¬ 
egories in the representational space which deter¬ 
mines the final accuracy of a model, not just the 
mean inter- and intra-class distances. 

A very deep network 

In previous sections we studied different DCNNs, 
each having 8 or 9 layers with 5 or 6 convolu¬ 
tional layers, from various perspectives and com¬ 
pared them with the human feed-forward object 
recognition system. Here, we assess how exploiting 
many more layers could affect the performance of 
DCNNs. To this end, we used Very Deep CNN [32] 
that has no fewer than 19 layers (16 convolutional 
and 3 fully connected layers). We extracted fea¬ 
tures of layers 9 to 18 from images with natu¬ 
ral backgrounds, to investigate if more layers in 
the Very Deep CNN affects the final accuracy and 
human-likeness. 

Figure [13K illustrates that the classification ac¬ 
curacy tends to improve as images are processed 
through consecutive layers. The accuracies of lay¬ 
ers 9, 10, and 11 are almost the same. But, the 
accuracy gradually increases over the next layers 
and culminates in layer 16 (the topmost convolu¬ 
tional layer), which significantly outperforms hu¬ 
mans even at the highest variation level (see the 
color-coded circles above this figure). Here again, 
the accuracy drops in fully connected layers that 
are optimized for the Imagenet classification. Nev¬ 
ertheless, the accuracies of the highest layer (layer 
18) are still higher than those of humans for all 
variation levels. 

Figure demonstrates the accuracies of the 
last and best-performing layers of all models in 
comparison with humans for the highest variation 


level (level 7) in the natural background task. The 
color-coded matrix on the right shows the p-values 
for all pairwise comparisons between models and 
humans computed by the Wilcoxon rank sum test. 
It can be seen that the Very Deep CNN significantly 
outperforms all other DCNNs in both cases. It is 
also evident that the best-performing layer of this 
model significantly outperforms humans. However, 
the accuracies of all other DCNNs are below the 
humans, and the gap is significant for all models 
but CNN-S and CNN-M. 

We also computed the RDM of the Very Deep 
model for all variation levels and layers 9 to 18 
in the natural background condition (see Fig. 14). 
Calculating the correlations between the model’s 
and humans’ RDMs shows that the last three layers 
had the highest correlations with human RDM (see 
Fig. [I^). The correlation values of other layers 
drastically decrease down to 0.05, indicating that 
these layers are less robust to object variations than 
the last layers. However, the statistical analysis 
demonstrates that almost all correlation values are 
significant (see color-coded points above the plot), 
suggesting that although the amount of similarity 
between the RDM of humans and that of the Very 
Deep model’s layers are small, these similarities are 
not random but statistically meaningful. Hence, it 
can be said that the layers of Very Deep CNN pro¬ 
cess images in a somewhat human-like way. Finally, 
Fig. HP compares the correlation values between 
the RDM of humans and the one of the last as 
well as the best-correlated layers of all DCNNs in 
the natural background condition. As can be seen, 
the Very Deep CNN and Zeiler and Fergus models 
have the highest correlation values in both cases, 
with large statistical difference compared to other 
models. 


Discussions 

Invariant object recognition has always been a de¬ 
manding task to solve in computer vision, yet it 
is simply done by a two-year old child. However, 
the emergence of novel learning mechanisms and 
computational models in recent years has opened 
new avenues for solving this highly complex task. 
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Figure 13: The accuracy and human-likeness of the Very Deep CNN with natural backgrounds. A. Classification accuracy of the 
Very Deep CNN (layers 9-18 ) and humans across the seven levels of object variations. Each colored curve shows the accuracy of one layer of the 
model. The accuracy of the Pixel representation is depicted using a dashed, dark purple curve. The gray curve indicates human categorization 
accuracy across the seven levels. The color-coded points at the top of the plot indicate whether there is a significant difference between the 
accuracy of humans and each layer of the model (Wilcoxon rank sum test). Each color refers to a p-value, specified on the top-right (*: p < 0.05, 
p < 0.01, * * *: p < 0.001, * * p < 0.0001). We plot the mean accuracies +/- STD over 15 runs. Colored circles with error bars, on 
the pink area show the average accuracy of each layer across all variation levels (mean +/- STD). The horizontal lines underneath the circles, 
indicate whether the difference between human accuracy (gray circle) and each layer of the model is significant (Wilcoxon rank sum test; black 
line: significant, white line: insignificant). B. Top: the accuracy comparison between the best-performing layer in each model and humans at 
the last variation level (level 7). The color-coded matrix, on the right of the bar plot, shows the p-values for all pairwise comparisons between 
humans and models (Wilcoxon rank sum test). Numbers, written around the p-value matrix, correspond to models (H stands for human). 
Bottom: idem with the last layers. C. Correlation between humans and the Very Deep CNN RDMs, across different layers (layers 9-18) and 
levels. Each color-coded curve shows the correlation of one layer of the model with corresponding human RDM. The color-coded points at the 
top of the plot indicate whether the correlation is significant (Blue: significant; Gray: insignificant). Correlation values are the average over 
10,000 bootstrap resamples +/- STD. D. Top: correlations between the most correlated layer in each model and humans at the last variation 
level (level 7). P-value matrix was calculated using similar approach to B. Bottom: idem with the last layers. 


DCNNs have been shown to be a novel and pow¬ 
erful approach to tackle this problem [IHl EH ESI 
[Ti)l ESI EZl EHl EH EH ISD]- These networks have 
drawn scientists’ attention not only in vision sci¬ 
ences, but also in other fields of science (see j55]l. 
as a powerful solution for many complex problems. 


DCNNs are among the most powerful computing 
models inspired by computations performed in neu¬ 
ral circuits. To our interest, recent studies also con¬ 
firmed the abilities of DCNNs in object recognition 
problems (e.g. ca, EH, and [61]). Besides, several 
studies have tried to compare the responses of DC- 
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Figure 14: Representational Dissimilarity Matrices (RDM) of Very 
Deep model (layers 9 to 18) for different levels of variation (from level 
1-7) in natural background condition. Each element in a matrix shows 
the pairwise dissimilarities between the representations of two images 
(measured as 1-r, Spearman’s rank correlation. See Materials and 
Methods). The color bar at the top-right corner shows the degree 
of dissimilarity. The size of each matrix is 100 x 100, with 20 images 
from each category. This was done for the sake of better visualization. 

NNs and primate visual cortex in different object 
recognition tasks. 

Khaligh-Razavi and Kriegeskorte 123 compared 
the representational geometry of neuronal re¬ 
sponses in human (fMRI data; see jH]) and mon¬ 
key IT cortex (cell recording; see [39]) with several 
computational models, including one DCNN, on a 
96-image dataset. They showed that supervised 
DCNNs can explain IT representation. However, 
hrstly, their image database only contained frontal 
views of objects with no viewpoint variation. Sec¬ 
ondly, the number and variety of images were very 
low (only 96 images), compared to the wide variety 
of complex images in natural environment. Finally, 


images had a uniform gray background, which is 
very different from natural vision. To overcome 
such issues, Cadieu et. al. 1271 used a large image 
database, consisting of different categories, back¬ 
grounds, transformations, and compared the cate¬ 
gorization accuracy and representational geometry 
of three DCNNs and neural responses in IT and 
V4 of monkey. They showed that DCNNs closely 
resemble the responses of IT neurons either in accu¬ 
racy or geometry |271I37|. One issue in their study 
is the long stimulus presentation time (100 ms), 
which might be too long to only account for feed¬ 
forward processing. Moreover, they included only 
three DCNNs in their study. In another attempt, 
Giiglii et. al. [2H] mapped different layers of a 
DCNN onto the human visual cortex. More specif¬ 
ically, they computed the representational similar¬ 
ities among different layers of a DCNN and the 
fMRI data from different areas in human visual cor¬ 
tex. Although these studies have shown the power 
of several DCNNs in object recognition, advance¬ 
ments in developing new DCNNs are quick, which 
requires continuous assessments of recent DCNNs 
using different techniques. Moreover, the ability of 
DCNNs in tolerating object variations (mostly 3-D 
variations) had not been carefully evaluated before. 

Here, we comprehensively tested eight best per¬ 
forming DCNNs, reported in the literature usmsi 
imiiHiEii l32] . in a very challenging vision task, 
namely invariant object recognition. This list of 
DCNNs has shown remarkable accuracies in clas¬ 
sification of big and challenging image databases 
such as Imagenet, VOC 2007, and Caltech 205. 
Moreover, we compared the DCNNs with human 
subjects performing the same task with the same 
images to investigate the extent to which DCNNs 
resemble humans. 

DCNNs achieve human-level per¬ 
formance in rapid invariant object 
recognition task 

Humans are very fast and accurate at categorizing 
objects 01621163]. Numerous studies have inves¬ 
tigated this remarkable performance under ultra¬ 
rapid image presentation [631 ESI ES]- It is be- 
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lieved that rapid object categorization is mainly 
performed by the feed-forward information flow 
through the ventral visual pathway [671163]. Ex¬ 
perimental and theoretical evidence suggests that 
feed-forward processing is able to perform invari¬ 
ant object recognition [HliaiSl[7|. Here, we mea¬ 
sured human accuracy when categorizing five ob¬ 
ject categories in a rapid presentation paradigm. 
Objects varied in six dimensions and the task dif¬ 
ficulty was controlled using seven variation levels. 
Results showed that humans achieved high accu¬ 
racy across all levels (under 2- and 3-D variations) 
while objects were only presented for 25 ms. 

Using the same image database, we also evalu¬ 
ated eight state-of-the-art DCNNs [ISl UHl UTl UHl 
[SI], largely inspired by feed-forward processing of 
visual cortex. Results indicated that these DCNNs 
can mimic human accuracy (see Fig. |^to Fig. |^. 
However, the HMAX model, as one of the early suc¬ 
cessful models, showed very poor performance in 
almost all experiments. We also showed in our pre¬ 
vious study that such shallow feed-forward models 
fails to achieve human-level accuracy in invariant 
object categorization [19]. 

We further performed layer-specific analysis to 
investigate how accuracy and representational ge¬ 
ometry evolve across consecutive layers in DCNNs. 
Results illustrated that accuracies tend to increase 
as images are processed through the layers; how¬ 
ever, some layers achieved very similar accuracies. 
If some layers do not considerably contribute to the 
final accuracy, at least in our task, one is tempted 
to remove it, to reduce the computational load of 
the DCNN, which is typically very high. For ex¬ 
ample it has been shown that eliminating one of 
the middle layers leads to just 2% accuracy drop in 
Krizhevsky model on the Imagenet database [T^ . 
More research is needed to systematically evaluate 
the role of different layers by removing each layer 
and evaluating the resulting accuracy. However, 
this should be done using different image databases 
since these DCNNs were optimized for Imagenet 
database. Therefore, the layer-specific effect might 
be database dependent. 

The layer-specific analysis is interesting as it 
shows that not only the accuracy, but also the rep¬ 
resentational geometry evolves through layers. To 


our knowledge, only one study [20] had investigated 
the layer-specific responses in one DCNN. A possi¬ 
ble future study would be comparing the responses 
of several visual cortical areas with different layers 
of DCNNs as it helps to understand what is missing 
in models and layers. Cadieu et. al. [27] compared 
the responses of monkey IT and V4 neurons with 
the penultimate layer of three DCNNs, but they 
did not tested, for example, how V4 responses are 
correlated to other layers. 

RDMs (Fig. and confusion matrices (Fig. 
of the last layer of DCNNs demonstrated that in¬ 
creasing the level of object variations can disturb 
object representations and increase the misclassifi- 
cation rate, but less so for the higher layers. Con¬ 
versely, for low variation levels, shallow models ac¬ 
tually outperform both deeper ones and humans. 
This means that, even if deep nets have attracted a 
lot of attention recently, deeper is not always bet¬ 
ter. To classify images with weak viewpoint varia¬ 
tions (e.g. passport photos), a shallow model might 
lead to the best performance. In addition, its com¬ 
putational load will be much lower, and training 
will require much fewer labeled examples. 

It is possible, and even likely, that having incon- 
gruent backgrounds can affect the human accuracy 
in some cases. However, we ran the same exact ex¬ 
periments with uniform backgrounds. This helped 
us to find an upper bound for the human perfor¬ 
mance (see Fig. |^. Even in this case, models can 
reach human-level accuracy. Moreover, since both 
humans and DCNNs saw the objects in a congruent 
context during the development, eliminating the 
contextual information in the background, or using 
an incongruent background, presumably similarly 
affect the humans and the models. 

In summary, our results demonstrate the abil¬ 
ity of DCNNs to reach human (feed-forward vision) 
accuracy in invariant object recognition. This con¬ 
firms the success of these computational models to 
mimic the performance of the visual neural circuits 
in such a difficult task. When variation level is 
high, shallow networks have low accuracies, while 
as we move through the layers of DCNNs the invari¬ 
ance gradually increase in such a way that the Very 
Deep network (with 19 layers) can even outperform 
the humans. Another important point is that both 
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2- D and 3-D variations could be handled by 2-D 
features extracted through the layers of DCNNs. 
Although some 2-D variations, such as position, 
are treated through many convolutional layers (us¬ 
ing shared-weight filters in different positions), DC¬ 
NNs do not have any built-in mechanism to over¬ 
come 3-D variations (such as in-depth-rotation). 
Thus, these invariances must be learned. Regard¬ 
ing to different theories of how the brain reaches 

3- D invariance, our results suggest that 3-D rota¬ 
tion invariance can be achieved using 2-D features 
and not necessarily by construction of 3-D object 
models. However, the difference between the error 
distributions and object representations of DCNNs 
and humans suggest that they use different infor¬ 
mation to handle invariant object recognition, pre¬ 
sumably due to structural and learning differences. 
The human visual system exploits feedback signals, 
bottom-up and top-down attentions, continuous vi¬ 
sual information, and temporal learning. So if using 
more layers can substantially improve the perfor¬ 
mance of machine vision algorithms, adding other 
properties of the visual system can make more ad¬ 
vances. This could, in reverse, give important clues 
about the nature of neural processing in the visual 
cortex. 

Network architecture plays a very 
important role 

Here, we evaluated several DCNNs with different 
architectures and training sets, which led to dif¬ 
ferent accuracies. Zeiler and Fergus, CNN-M and 
CNN-S achieved higher accuracies than Keizhevsky 
model, while they used smaller receptive fields and 
smaller stride in the first convolutional layer. Be¬ 
sides, CNN-M and CNN-S outperformed Zeiler and 
Fergus using more convolutional features in layers 
3, 4 and 5. Nevertheless, Overfeat that exploits 
extensively more features in these layers had trou¬ 
bles with invariant object recognition. Interest¬ 
ingly, Very Deep CNN, which significantly outper¬ 
forms all models as well as humans, has about twice 
convolutional layers as other DCNNs but smaller 
(3 X 3) receptive fields. 

Although it is not clear why some DCNNs per¬ 
form better than others, our results suggest that 


networks with deeper architecture, and convolu¬ 
tional layers with small filter size but with more 
feature planes can achieve higher performances. In 
any case, an extensive optimization is required to 
find the best architecture and parameter settings 
for DCNNs. It is also important to point out that 
despite utilizing similar architectures but differ¬ 
ent training datasets, Keizhevsky and Hybrid-CNN 
models had close performances. These results sug¬ 
gest that architecture is more important than the 
training set. Hence, future studies should focus on 
how to evaluate different architectures to find the 
optimum one. 

DCNNs lack important processing 
mechanisms that exist in biological 
vision 

We tried to only allow feed-forward processing in 
our psychophysical experiment by using short pre¬ 
sentation time and backward masking, weakening 
the effect of back projections. However, this does 
not completely rule out the effects of feedback con¬ 
nections in the visual system. Conversely, DCNNs 
are feed-forward only models without any feedback 
mechanisms from upper to lower layers (note that 
error back propagation is not considered as a feed¬ 
back mechanism because it only occurs during the 
learning, not the recognition). Adding a feedback 
mechanism to DCNNs could increase their perfor¬ 
mance, and this could be useful for complex vi¬ 
sual tasks (e.g., variation level 7 in our data). This 
would inevitably increases the computational load 
of DCNNs and that might be the reason why DC¬ 
NNs still lack a feedback mechanism. Another issue 
is how to learn feedback connections. 

In addition to object recognition, feedback con¬ 
nections plays a pivotal role in other visual pro¬ 
cesses such as figure-ground segregation [SHI EH], 
spatial and feature-based attention uoi, and per¬ 
ceptual learning mi. As shown in our results, the 
accuracies of DCNNs significantly drops in case of 
objects with natural backgrounds. This could be 
due to the lack of a figure-ground segregation in the 
models. Indeed, the primate visual system is able 
to separate the parts of image which belong to the 
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target object from the background and other ob¬ 
jects. It has been suggested that recurrent process¬ 
ing is required for the completion of figure-gronnd 
segregation (see [HH] and (SH])- Also, the mecha¬ 
nisms of bottom-up and top-down attention in the 
human visual system emphasizes the most salient 
and relevant parts of the images, which contain 
more information and can facilitate the categoriza¬ 
tion process. Several studies I121IZ21IZ3] have shown 
that recurrent processing can enhance object rep¬ 
resentations in IT and facilitate invariant object 
recognition. DCNNs lack snch mechanisms, and 
they could help to increase the recognition accn- 
racy, especially in cluttered images and this could 
be another direction for future improvement of DC¬ 
NNs. 

Future directions 

Our image database has several advantages for 
studying invariant object recognition. Firstly, it 
contains a large number of object images, chang¬ 
ing across different levels of variations of posi¬ 
tion, scale, and in-depth and in-plane rotations, 
and background. Secondly, we had a precise con¬ 
trol over the amonnt of variations that let ns gen¬ 
erate images with different degrees of complex¬ 
ity/difficulty; Therefore, enabling us to scrutinize 
the behavior of humans and computational mod¬ 
els, while the complexity of object variations grad¬ 
ually increases. Thirdly, similar to several stnd- 
ies [la EH [711 [75], by eliminating dependencies be¬ 
tween objects and backgrounds, we were able to 
study invariance, independently of contextual ef¬ 
fects. 

However, there are several effective parameters in 
invariant object recognition for both humans and 
models that should be further investigated. It is 
important to explore how the consistency between 
objects and snrrounding environment wonld affect 
the object recognition process [761 [771 [ZH [79] and it 
should be further studied in invariant object recog¬ 
nition. Also, other parameters such as illumination, 
contrast, texture, noise, and occlusion need to be 
investigated in controlled experiments. 

Another important question that needs to be 
clearly addressed is whether all types of variations 


impose the same difficulty to humans and mod¬ 
els. A simple and short answer is “No”; how¬ 
ever, it remains nnclear which types of variation 
are more challenging, what the underlying mecha¬ 
nisms for it are. It has been shown that the brain 
responds differently to different types of object vari¬ 
ations. For instance, scale invariant responses ap¬ 
pear faster than position invariant ones |80|. In¬ 
terestingly, scale invariant responses in the human 
brain emerge early in development while view in¬ 
variance responses tend to emerge later, suggesting 
that simple processes snch as scale invariance could 
be built-in, while we would need more training 
to perform view invariant object recognition p]. 
Therefore, it is important, for both neuroscientists 
and computational modelers, to understand how 
the brain deals with different types of variations. 
From a compnter vision point of view, it seems that 
3-D variations (e.g., rotations in-depth) are more 
challenging than 2-D transformations (e.g., changes 
in position and scale) [221 EH ST] . Due to the struc¬ 
ture of DCNNs and the compntations performed in 
snch networks, they easily tackle with changes in 
position and, to some extent, in the scale of the 
objects. However, there is no built-in mechanism 
for invariance to 3-D transformations. Adding such 
a mechanism to the models should increase their ac¬ 
curacy as well as their resemblance to neurophys¬ 
iological data. A very recent modeling study [82] . 
inspired by physiological data from monkeys brain, 
shows that adding a view invariance mechanism to 
a feed-forward model can surprisingly explain face 
processing in monkey face patches [83l [8l| . 
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